{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Formatting and deduping data\n",
    "\n",
    "Formatting columns and removing duplicates is an important part of data preparation.\n",
    "\n",
    "Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions. \n",
    "\n",
    "This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How To"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"data/housing.csv\", dtype={\"housing_median_age\": int,\"ocean_proximity\": \"category\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "longitude              float64\n",
       "latitude               float64\n",
       "housing_median_age       int32\n",
       "total_rooms            float64\n",
       "total_bedrooms         float64\n",
       "population             float64\n",
       "households             float64\n",
       "median_income          float64\n",
       "median_house_value     float64\n",
       "ocean_proximity       category\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>longitude</th>\n",
       "      <th>latitude</th>\n",
       "      <th>housing_median_age</th>\n",
       "      <th>total_rooms</th>\n",
       "      <th>total_bedrooms</th>\n",
       "      <th>population</th>\n",
       "      <th>households</th>\n",
       "      <th>median_income</th>\n",
       "      <th>median_house_value</th>\n",
       "      <th>ocean_proximity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-122.23</td>\n",
       "      <td>37.88</td>\n",
       "      <td>41</td>\n",
       "      <td>880.0</td>\n",
       "      <td>129.0</td>\n",
       "      <td>322.0</td>\n",
       "      <td>126.0</td>\n",
       "      <td>8.3252</td>\n",
       "      <td>452600.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-122.22</td>\n",
       "      <td>37.86</td>\n",
       "      <td>21</td>\n",
       "      <td>7099.0</td>\n",
       "      <td>1106.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>1138.0</td>\n",
       "      <td>8.3014</td>\n",
       "      <td>358500.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-122.24</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52</td>\n",
       "      <td>1467.0</td>\n",
       "      <td>190.0</td>\n",
       "      <td>496.0</td>\n",
       "      <td>177.0</td>\n",
       "      <td>7.2574</td>\n",
       "      <td>352100.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-122.25</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52</td>\n",
       "      <td>1274.0</td>\n",
       "      <td>235.0</td>\n",
       "      <td>558.0</td>\n",
       "      <td>219.0</td>\n",
       "      <td>5.6431</td>\n",
       "      <td>341300.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-122.25</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52</td>\n",
       "      <td>1627.0</td>\n",
       "      <td>280.0</td>\n",
       "      <td>565.0</td>\n",
       "      <td>259.0</td>\n",
       "      <td>3.8462</td>\n",
       "      <td>342200.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \\\n",
       "0    -122.23     37.88                  41        880.0           129.0   \n",
       "1    -122.22     37.86                  21       7099.0          1106.0   \n",
       "2    -122.24     37.85                  52       1467.0           190.0   \n",
       "3    -122.25     37.85                  52       1274.0           235.0   \n",
       "4    -122.25     37.85                  52       1627.0           280.0   \n",
       "\n",
       "   population  households  median_income  median_house_value ocean_proximity  \n",
       "0       322.0       126.0         8.3252            452600.0        NEAR BAY  \n",
       "1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  \n",
       "2       496.0       177.0         7.2574            352100.0        NEAR BAY  \n",
       "3       558.0       219.0         5.6431            341300.0        NEAR BAY  \n",
       "4       565.0       259.0         3.8462            342200.0        NEAR BAY  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "int_cols = [\"total_rooms\", \"population\", \"households\", \"median_house_value\"]\n",
    "df[int_cols] = df[int_cols].astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "longitude              float64\n",
       "latitude               float64\n",
       "housing_median_age       int32\n",
       "total_rooms              int32\n",
       "total_bedrooms         float64\n",
       "population               int32\n",
       "households               int32\n",
       "median_income          float64\n",
       "median_house_value       int32\n",
       "ocean_proximity       category\n",
       "dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## De-duplicating data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "         ...  \n",
       "20635    False\n",
       "20636    False\n",
       "20637    False\n",
       "20638    False\n",
       "20639    False\n",
       "Length: 20640, dtype: bool"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         True\n",
       "1         True\n",
       "2         True\n",
       "3         True\n",
       "4         True\n",
       "         ...  \n",
       "20635     True\n",
       "20636     True\n",
       "20637     True\n",
       "20638     True\n",
       "20639    False\n",
       "Name: ocean_proximity, Length: 20640, dtype: bool"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.ocean_proximity.duplicated(\"last\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "         ...  \n",
       "3041      True\n",
       "19520     True\n",
       "22        True\n",
       "1215      True\n",
       "8840      True\n",
       "Length: 20645, dtype: bool"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.append(df.sample(5)).duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_dup = df.append(df.sample(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        False\n",
       "1        False\n",
       "2        False\n",
       "3        False\n",
       "4        False\n",
       "         ...  \n",
       "14693     True\n",
       "14572     True\n",
       "7974      True\n",
       "4140      True\n",
       "20393     True\n",
       "Length: 20645, dtype: bool"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dup.duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>longitude</th>\n",
       "      <th>latitude</th>\n",
       "      <th>housing_median_age</th>\n",
       "      <th>total_rooms</th>\n",
       "      <th>total_bedrooms</th>\n",
       "      <th>population</th>\n",
       "      <th>households</th>\n",
       "      <th>median_income</th>\n",
       "      <th>median_house_value</th>\n",
       "      <th>ocean_proximity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-122.23</td>\n",
       "      <td>37.88</td>\n",
       "      <td>41</td>\n",
       "      <td>880</td>\n",
       "      <td>129.0</td>\n",
       "      <td>322</td>\n",
       "      <td>126</td>\n",
       "      <td>8.3252</td>\n",
       "      <td>452600</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-122.22</td>\n",
       "      <td>37.86</td>\n",
       "      <td>21</td>\n",
       "      <td>7099</td>\n",
       "      <td>1106.0</td>\n",
       "      <td>2401</td>\n",
       "      <td>1138</td>\n",
       "      <td>8.3014</td>\n",
       "      <td>358500</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-122.24</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52</td>\n",
       "      <td>1467</td>\n",
       "      <td>190.0</td>\n",
       "      <td>496</td>\n",
       "      <td>177</td>\n",
       "      <td>7.2574</td>\n",
       "      <td>352100</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-122.25</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52</td>\n",
       "      <td>1274</td>\n",
       "      <td>235.0</td>\n",
       "      <td>558</td>\n",
       "      <td>219</td>\n",
       "      <td>5.6431</td>\n",
       "      <td>341300</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-122.25</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52</td>\n",
       "      <td>1627</td>\n",
       "      <td>280.0</td>\n",
       "      <td>565</td>\n",
       "      <td>259</td>\n",
       "      <td>3.8462</td>\n",
       "      <td>342200</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20635</th>\n",
       "      <td>-121.09</td>\n",
       "      <td>39.48</td>\n",
       "      <td>25</td>\n",
       "      <td>1665</td>\n",
       "      <td>374.0</td>\n",
       "      <td>845</td>\n",
       "      <td>330</td>\n",
       "      <td>1.5603</td>\n",
       "      <td>78100</td>\n",
       "      <td>INLAND</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20636</th>\n",
       "      <td>-121.21</td>\n",
       "      <td>39.49</td>\n",
       "      <td>18</td>\n",
       "      <td>697</td>\n",
       "      <td>150.0</td>\n",
       "      <td>356</td>\n",
       "      <td>114</td>\n",
       "      <td>2.5568</td>\n",
       "      <td>77100</td>\n",
       "      <td>INLAND</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20637</th>\n",
       "      <td>-121.22</td>\n",
       "      <td>39.43</td>\n",
       "      <td>17</td>\n",
       "      <td>2254</td>\n",
       "      <td>485.0</td>\n",
       "      <td>1007</td>\n",
       "      <td>433</td>\n",
       "      <td>1.7000</td>\n",
       "      <td>92300</td>\n",
       "      <td>INLAND</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20638</th>\n",
       "      <td>-121.32</td>\n",
       "      <td>39.43</td>\n",
       "      <td>18</td>\n",
       "      <td>1860</td>\n",
       "      <td>409.0</td>\n",
       "      <td>741</td>\n",
       "      <td>349</td>\n",
       "      <td>1.8672</td>\n",
       "      <td>84700</td>\n",
       "      <td>INLAND</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20639</th>\n",
       "      <td>-121.24</td>\n",
       "      <td>39.37</td>\n",
       "      <td>16</td>\n",
       "      <td>2785</td>\n",
       "      <td>616.0</td>\n",
       "      <td>1387</td>\n",
       "      <td>530</td>\n",
       "      <td>2.3886</td>\n",
       "      <td>89400</td>\n",
       "      <td>INLAND</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>20640 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \\\n",
       "0        -122.23     37.88                  41          880           129.0   \n",
       "1        -122.22     37.86                  21         7099          1106.0   \n",
       "2        -122.24     37.85                  52         1467           190.0   \n",
       "3        -122.25     37.85                  52         1274           235.0   \n",
       "4        -122.25     37.85                  52         1627           280.0   \n",
       "...          ...       ...                 ...          ...             ...   \n",
       "20635    -121.09     39.48                  25         1665           374.0   \n",
       "20636    -121.21     39.49                  18          697           150.0   \n",
       "20637    -121.22     39.43                  17         2254           485.0   \n",
       "20638    -121.32     39.43                  18         1860           409.0   \n",
       "20639    -121.24     39.37                  16         2785           616.0   \n",
       "\n",
       "       population  households  median_income  median_house_value  \\\n",
       "0             322         126         8.3252              452600   \n",
       "1            2401        1138         8.3014              358500   \n",
       "2             496         177         7.2574              352100   \n",
       "3             558         219         5.6431              341300   \n",
       "4             565         259         3.8462              342200   \n",
       "...           ...         ...            ...                 ...   \n",
       "20635         845         330         1.5603               78100   \n",
       "20636         356         114         2.5568               77100   \n",
       "20637        1007         433         1.7000               92300   \n",
       "20638         741         349         1.8672               84700   \n",
       "20639        1387         530         2.3886               89400   \n",
       "\n",
       "      ocean_proximity  \n",
       "0            NEAR BAY  \n",
       "1            NEAR BAY  \n",
       "2            NEAR BAY  \n",
       "3            NEAR BAY  \n",
       "4            NEAR BAY  \n",
       "...               ...  \n",
       "20635          INLAND  \n",
       "20636          INLAND  \n",
       "20637          INLAND  \n",
       "20638          INLAND  \n",
       "20639          INLAND  \n",
       "\n",
       "[20640 rows x 10 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dup[~df_dup.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise\n",
    "Try generating  unique values in the median age from the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[...]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional Resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [Pandas AsType](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)\n",
    "- [Pandas Duplicated Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}